Optimization issues in inverted index-based entity annotation

نویسندگان

Ganesh Ramakrishnan

Sachindra Joshi

Sanjeet Khaitan

Sreeram Balakrishnan

چکیده

Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on ‘John’s address’, that returns matches to all entities annotated as an address that co-occur with ‘John’. A dominant paradigm adopted by rulebased named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed alternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an order of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level optimization of the order of index operations required for the annotation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evaluation plans for a given rule. Additionally, we prove that this problem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entity Annotation based on Inverse Index Operations

Entity annotation involves attaching a label such as ‘name’ or ‘organization’ to a sequence of tokens in a document. All the current rule-based and machine learningbased approaches for this task operate at the document level. We present a new and generic approach to entity annotation which uses the inverse index typically created for rapid key-word based searching of a document collection. We d...

متن کامل

Strategies for Large-Scale Entity Resolution Based on Inverted Index Data Partitioning

Inverted indexing is a commonly used technique for improving the performance of entity resolution algorithms by reducing the number of pair-wise comparisons necessary to arrive at acceptable results. This chapter describes how inverted indexing can also be used as a data partitioning strategy to perform entity resolution on large datasets in a distributed processing environment. This chapter di...

متن کامل

Sniper: A search engine for domain semantic knowledge

This paper presents Sniper, a knowledge-based computer field search engine in Semantic Web. Sniper takes WordNet as background ontology and integrates the entities in the semantic documents by mapping them to the synsets of WordNet. Sniper returns the most related knowledge in computer field as result according to user's query. The search results of Sniper are displayed in the form of list of e...

متن کامل

Implementation and Optimization of Annotation and Interpretation Step of Next-Generation Sequencing Data for Non-Syndromic Autosomal Recessive Hearing Loss

Introduction: The precision and time required for analysis of data in next-generation sequencing (NGS) depends on many factors including the tools utilized for alignment, variant calling, annotation and filtering of variants, personnel expertise in data analysis and interpretation, and computational capacity of the lab and its optimization is a challenging task. Method: An application software...

متن کامل

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, howev...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Optimization issues in inverted index-based entity annotation

نویسندگان

چکیده

منابع مشابه

Entity Annotation based on Inverse Index Operations

Strategies for Large-Scale Entity Resolution Based on Inverted Index Data Partitioning

Sniper: A search engine for domain semantic knowledge

Implementation and Optimization of Annotation and Interpretation Step of Next-Generation Sequencing Data for Non-Syndromic Autosomal Recessive Hearing Loss

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

عنوان ژورنال:

اشتراک گذاری